Sqoop Import in HDFS

The ‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files.Here Import command is used to import a table from mysql database to HDFS.
Import table data to HDFS (O/P file will be by default delimited text)
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp;
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp -m 2;
Sqoop imports data in parallel from most database sources. -m property is used to specify the number of mappers to be executed.Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or –num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal, you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance.
Import table data to HDFS (Tab separated file format)
Sqoop be default use 4 task in parallel

sqoop-import --options-file /home/cloudera/Desktop/data/pram.txt --fields-terminated-by '\t' --table emp

We can set the number of mapper --m options.
sqoop-import --options-file /home/cloudera/Desktop/data/pram.txt --fields-terminated-by '\t' --table emp -m 1

sqoop-import --options-file /home/cloudera/Desktop/data/pram.txt --fields-terminated-by '\t' --table emp --delete-target-dir -m 1
 
Import Data with “SPLIT-BY” OPTION

If there is no primary key in the table, then we need to specify no:mappers as 1 i.e., sequential import of data. we can explicitly specify no:of mappers for parallel import. If there is no primary key in the table & if we need parallel import then, use split-by "some column name" & can specify any numbers of mappers. Those many part-m files will be generated.It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.

sqoop import --options-file /home/cloudera/Desktop/data/pram.txt --table emp --target-dir=/sqoop/ImportSplit --fields-terminated-by '|' --where 'esal > 22000' --split-by empid -m 1;
sqoop import --options-file /home/cloudera/Desktop/data/pram.txt --table emp --target-dir=/sqoop/ImportSplit --fields-terminated-by '|' --where 'esal > 22000' --split-by empid -m 4;

Import table data to HDFS (Save To Target Directory)
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp --target-dir=/sqoop/sqoop_data1 -m 1;
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp --target-dir=/sqoop/Import1 -m 2;
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --table orders -m 1;

Pass Parameter file to Sqoop
sqoop-list-tables --options-file /home/cloudera/Desktop/data/pram.txt;
cat>pram.txt
--connect
jdbc:mysql://localhost:3306/dwdev
--username
root
--password

cloudera

Import Data in Append Mode

Sqoop is able to append data to an existing data set by adding the argument --append. In append mode, files will be created in a temporary location first. If all data is imported successfully, they will be moved to the destination. In case of any errors, the import is considered as failed. Temporary created files will be destroyed.
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera  --table emp --target-dir=/sqoop/sqoop_data1  -m 1  --append;

sqoop import --options-file /home/cloudera/Desktop/data/pram.txt  --table emp --target-dir=/sqoop/sqoop_data1  -m 1  --append; 

To import all tables data form RDBMS to HDFS
sqoop import-all-tables --options-file /home/cloudera/Desktop/data/pram.txt -m (default Location)
sqoop import-all-tables --options-file /home/cloudera/Desktop/data/pram.txt --target-dir=/sqoop/data -m 1;
Import table data to HDFS(where condition)
sqoop import --options-file /home/cloudera/Desktop/data/pram.txt --table emp --target-dir=/sqoop/ImportCond --fields-terminated-by '|' --where 'esal > 22000' -m 1;

Import table data to HDFS(Import Only Specific columns)
sqoop import --options-file /home/cloudera/Desktop/data/pram.txt -m 1 --table emp --columns 'ename,esal' --target-dir=data/sqoop_data4 --fields-terminated-by '|' ;
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp --columns empid,esal --target-dir=data/sqoop_data5 --fields-terminated-by '\t' -m 1;
Import table data to HDFS(Import Only Specific tables)
sqoop import-all-tables --options-file /home/cloudera/Desktop/data/pram.txt -m 1 --exclude-tables emp,student;
sqoop import-all-tables --options-file /home/cloudera/Desktop/data/pram.txt --exclude-tables emp,student --target-dir=data/sqoop_data6 -m 1;

No comments:

Post a Comment